Mini Challenge 2

"University of Constance - Applied Visual Analytics"

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

Marc Rene Broghammer, University of Constance, broghama@inf.uni-konstanz.de

Juergen Schniertshauer, University of Constance, schniert@inf.uni-konstanz.de

Dr. Peter Bak, University of Constance, Peter.Bak@uni-konstanz.de

Tool(s):

Our project makes use of the Konstanz Information Miner (KNIME). KNIME is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models. KNIME is developed by the Chair for Bioinformatics and Information Mining at the University of Konstanz, Germany. KNIME is based on the Eclipse platform and, through its modular design, easily extensible. When desired, custom pipeline nodes can be implemented in KNIME within hours thus extending KNIME to comprehend and provide first-tier support for highly domain-specific data. KNIME also offers the possibility to integrate small code snippets as Java- and R-snippets. Creating the specific pipeline to solve the Mini Challenge in an iterative process, we estimate our effort to approximately 20 to 30 hours, excluding the time we needed to get used to the tool.

Further analysis was made with the SAV framework, written by Stefan Moritz Koch as part of his Master thesis at the Working Group for Databases. Data Analysis and Visualization at the University of Constance. The framework is currently under development and geared toward helping analysts find temporal patterns based on a combination of automatic and visual methods. We make use of the TreeMap-visualization that is part of the framework to analyze issues that cannot be covered by the simple visualizations integrated into KNIME.

Video:

View the video for Mini Challenge II

ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.

1. Our Pipeline

Figure 1: The analytical pipeline used to solve Mini Challenge 2

Figure 1 shows the analytical pipeline we developed to solve MC2. The solution is based on the Visual Analytics pipeline. Our goal was to tightly integrate automated data mining, user interaction and visualization. Automated data mining provides the scalability necessary to handle large datasets. The user contributes human perception, flexibility, fine tuning of parameters and pattern recognition. Visualization allows us to combine both methods optimally.

We use a combination of simple viszalizations and Treemaps to support human interaction. Simple visualizations enable rapid extraction of information while the Treemap supplements our approach.
The KNIME pipeline is ideally suited for the visualization and analysis requirements of task 1 as new diseases can be analyzed with little effort.

2. Data Processing

2.1 Data Integration
The challenge data is organized into two csv-files for each country. One file consists of the hospitalization data for all patients. The other file contains the data associated with the deaths of patients. The first step is to integrate the data by joining the files for each country and combining the results.
Due to the 'patient-ID' from the original dataset only being unique for each country, a new 'patient-ID' key is required. The attribute
'dies' is also derived.

Table1: Derived Table Format

patient-id	hosp. date	gender	age	syndrome	country	death date	dies

2.1.1 First temporal analysis
This data is used to do first analysis concerning the difference between hospitalization date and date of death. It shows that almost every patient who died had been in hospital for 8 days.

2.2 Data Processing
Our data cleaning and processing phase consisted of extracting the most frequent abbreviations found in the 'syndrome' column, which we defined as strings of length <= 3. The frequencies of the abbreviations were then visualized in a histogram and we wrote a replacement list for frequent abbreviations. Next, the data has to be transformed into a table containing a uniform seperation between single symptoms in a symptoms string. We standardise the symptoms to a comma-separated form.

After this step, the symptoms, now standardised, are once again visualized and we see, that there are no symptoms with a number of dying patients between 392 and 3275. In further analysis, we consider only the symptoms with more than 1000 deaths as important. We scan the list for equivalent notations of the same symptoms and replace them. We now have our final list of symptoms.

Figure 2: Bar Plot of all Symptoms with a mortality of more than 1000 patients

3. Analysis

3.1 The symptoms

Figure 3: Number of hospitalized people per day

Using a timeseries plot, we were able to detect minor symptoms from their correlation with the primary symptoms over time. In summary, we conclude that the following symptoms are caused by the virus:

Major symptoms:

Vomiting
Diarrhea
Nose Bleeding
Abdominal Pain
Back Pain

Minor symptoms:

Conjunctivitis Red
Encephalitis
Facial Swelling
Hearing Loss
Proteinuria
Tremor

The minor symptoms did not seem to be important initially but the lower curves in Figure 3 clearly follow the temporal pattern of the main symptoms visualized by the upper curve.

The following analysis includes all relevant explored combinations.

3.2 Mortality Rates

Figure 4: Treemap visualizing the change of mortality rates over time : Every column represents a country. Within each country time is mapped from top to bottom. Symptoms
are mapped from left to right. The more red, the higher the mortality rate is.

A quick look at an ordered histogram of the mortality rates shows some outliers with 100% mortality rate, relevant symptoms with approximately 10% mortality rate and lots of noise symptoms, with 1% or less. Figure 4 shows the change of mortality rates over time. We expected there to be a simple correlation between symptoms and mortality rate, but the TreeMap reveals that mortality rates are not constant for a given syndrome. We can also see that mortality rates do not follow a clear pattern dependent on the symptom or time. Moreover there is the anomaly of the mortality rates peaking towards the end of the recovery period for almost every country.

3.3 Temporal Patterns and spread of the disease

The frequencies of all relevant symptoms begin to rise around April 20, peak around May 16, and level off around June 13.

Based on the onset and the peaks of the disease, we can charakterize the spread the following way:

Nairobi (Kenya)
Aleppo (Syria)
Karachi (Pakistan)
Lebanon
Iran, Venezuela, Yemen, Saudi-Arabia
Colombia

We can clearly see that the spread of the disease is not dependend on the physical distance between the countries. In particular, the disease does not spread from one country to the neighbouring countries, as Kenya does not have neighbours affected by the disease. This pattern of spread can rather be explained by infected people flying from one country to another and infecting people in the destination country.

MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.

1. Our Pipeline

Figure 1: The analytical pipeline used to solve Mini Challenge 2

Figure 1 shows the analytical pipeline we developed to solve MC2. This is the same pipeline we have used to solve Task 1. The solution is based on the Visual Analytics pipeline. Our goal was to tightly integrate automated data mining, user interaction and visualization. Automated data mining provides the scalability necessary to handle large datasets. The user contributes human perception, flexibility, fine tuning of parameters and pattern recognition. Visualization allows us to combine both methods optimally.

We use a combination of simple viszalizations and Treemaps to support human interaction. Simple visualizations enable rapid extraction of information while the Treemap supplements our approach.
The KNIME pipeline is ideally suited for the visualization and analysis requirements of task 1 as new diseases can be analyzed with little effort.

2. Data Processing

2.1 Data Integration
The challenge data is organized into two csv-files for each country. One file consists of the hospitalization data for all patients. The other file contains the data associated with the deaths of patients. The first step is to integrate the data by joining the files for each country and combining the results.
Due to the 'patient-ID' from the original dataset only being unique for each country a new 'patient-ID' key is required. The attribute
'dies' is also derived.

Table1: Derived Table Format

patient-id	hosp. date	gender	age	syndrome	country	death date	dies

2.1.1 First temporal analysis
This data is used to do first analysisconcerning the difference between hospitalization date and date of death. It shows that almost every patient who died had been in hospital for 8 days.

2.2 Data Processing
Our data cleaning and processing phase consisted of extracting the most frequent abbreviations found in the 'syndrome' column, which we defined as strings of length <= 3. The frequencies of the abbreviations were then visualized in a histogram and we wrote a replacement list for frequent abbreviations. Next, the data has to be transformed into a table containing a uniform seperation between single symptoms in a symptoms string. We standardise the symptoms to a comma-separated form.

After this step, the symptoms, now standardised, are once again visualized and we see here too, that a lower bound exists indicating that a gap in the symptoms with a total mortality
of between 392 and 3275. In further analysis, we consider only the symptoms with more than 1000 deaths as important. We scan the list for equivalent notations of the same symptoms and replace them. We now have our final list of symptoms.

Figure 2: Bar Plot of all Symptoms with a mortality of more than 1000 patients

3 Analysis

3.1 Countries to consider

Figure 3: Normalized number of hospitalized people per day

Figure 3 shows that in all countries except for Thailand and Turkey, the normalized number of hospitalized people per day exhibits the profile of an epidemic.

3.2 Temporal Patterns between countries

We now focus our attention on the the visualization in Figure 3. At first glance, it seems as if Lebanon is the origin of the disease because it starts at a high level of hospitalized people. On the other hand, the data from Lebanon does not show a constant increase at this level of temporal detail. Lebanon also starts at a comparably high rate of infection with approximately 10% of the maximum number of hospitalized people. At a lower level of smoothing we can see that the Lebanese data is unusually "noisy" and oscillates around 10% of its maximum for the first days before it begins to rise.
We therefore consider Nairobi as the first country infected with other countries following because Nairobi starts at the bottom and overtakes Lebanon pretty early. This hypothesis can also be confirmed by the SAV-framework where we can clearly see that Nairobi first shows a significant increase in deaths with other countries following:

Figure 4: A Treemap Visualization of the normalized count of deaths. Blocks represent cities, time is represented as a progression from top to bottom. Within each lock, each vertical column represents a symptom, color represents death rate (the higher the rate, the darker the color)

A continuous rise in death rate starts on May 23 in Nairobi. This leads us to conclude that the epedemic started in Nairobi. The hypothesis is also supported by the fact that Nairobi is the first country to reach it's peak death rate.
The time of the onset (measured from start to peak) varies from 24 days in Aleppo to 31 days in Columbia. Most countries have an onset of 25 or 26 days (six countries in total: three having an onset of 25 and three having an onset of 26 days.) In contrast to the onset, the time needed for recovery is almost uniform for all countries with the exception of Lebanon as their curves are moving in parallel.

3.3 An Anomaly: Countries exhibiting an uneven growth in death counts

Figure 5: Some countries exhibit a phase of slower growth in death counts

Examining the data at a lower level of temporal detail reveals two broad differences in the way the disease developed. As you can see in figure 5, in Columbia, Lebanon, Iran, Saudi-Arabia and Venezuela there is a kind of break during the onset. In all other countries there is a steady growth in death counts, even when viewed as a three day moving average. This can phase of lower death counts can also be seen at the mortality rates (see Task I Figure 4), which show a small valley during this period of time.

"University of Constance - Applied Visual Analytics"

VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

Tool(s):

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread